-
Notifications
You must be signed in to change notification settings - Fork 322
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[pytx] Implement a new cleaner PDQ index solution #1698
Conversation
python-threatexchange/threatexchange/signal_type/pdq/pdq_index2.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good so far!
Toplevel comments:
- I am not convinced yet that this indexing solution is correct yet. To help convince me, you can write a unittest that generates two large sets of random hashes (PDQSignal.get_random_signal()) b and q (e.g. 1,000 and 10,000), uses the brute force method to check exactly how many of hashes in b should match the hashes in q, then loads the b hashes into your indexing solution, and then queries all of the q hashes and makes sure you exactly get the expected results.
- Many of your existing tests are tautological or should check to make sure the matching hashes are actually the ones that should be matching. You can likely replace many of your existing tests with variations on 2 above - I put some ideas for you in the IndexRecord class, but there are others that might work.
- I recommend adding a method that can return a random hash exactly X bits different from a given hash, and then using that to help in your tests.
python-threatexchange/threatexchange/signal_type/pdq/pdq_index2.py
Outdated
Show resolved
Hide resolved
python-threatexchange/threatexchange/signal_type/pdq/pdq_index2.py
Outdated
Show resolved
Hide resolved
def add(self, pdq_strings: t.Sequence[str]) -> None: | ||
""" | ||
Add PDQ hashes to the FAISS index. | ||
Args: | ||
pdq_strings (Sequence[str]): PDQ hash strings to add | ||
""" | ||
vectors = self._convert_pdq_strings_to_ndarray(pdq_strings) | ||
self.faiss_index.add(vectors) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: It's somewhat confusing that this is add
which takes multiple here, but the other is add
which takes singular.
ignorable: I think you can just move this functionality to PDQIndex2
python-threatexchange/threatexchange/signal_type/pdq/pdq_index2.py
Outdated
Show resolved
Hide resolved
python-threatexchange/threatexchange/signal_type/pdq/pdq_index2.py
Outdated
Show resolved
Hide resolved
# Create a near-match by flipping a few bits | ||
base_hash = pdq_hashes[0] | ||
near_hash = hex(int(base_hash, 16) ^ 0xF)[2:].zfill(64) # Flip 4 bits |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blocking: Did you forget to query near_hash?
ignorable: In another similar project, I wrote a helper on the signal type itself to provide a random hash that was X distance away from a given hash by flipping random bits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I query the near_hash on the line below
python-threatexchange/threatexchange/signal_type/tests/test_pdq_index2.py
Outdated
Show resolved
Hide resolved
pdq_hashes = [ | ||
"f8f8f0cee0f4a84f06370a22038f63f0b36e2ed596621e1d33e6b39c4e9c9b22", | ||
"f" * 64, | ||
"0" * 64, | ||
"a" * 64, | ||
] | ||
index = PDQIndex2[str]( | ||
entries=[ | ||
("f8f8f0cee0f4a84f06370a22038f63f0b36e2ed596621e1d33e6b39c4e9c9b22", 0) | ||
] | ||
) | ||
return index, pdq_hashes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blocking: Why do you return hashes which aren't in the index?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so that I can use the other hashes to test for unmatch when needed.
For example, in the test: test_one_entry_sample_index
python-threatexchange/threatexchange/signal_type/tests/test_pdq_index2.py
Outdated
Show resolved
Hide resolved
python-threatexchange/threatexchange/signal_type/tests/test_pdq_index2.py
Outdated
Show resolved
Hide resolved
python-threatexchange/threatexchange/signal_type/pdq/pdq_index2.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! We're nearly there, only a few more blocking comments!
python-threatexchange/threatexchange/signal_type/tests/test_pdq_index2.py
Outdated
Show resolved
Hide resolved
python-threatexchange/threatexchange/signal_type/tests/test_pdq_index2.py
Outdated
Show resolved
Hide resolved
index = PDQIndex2() | ||
for i, base_hash in enumerate(base_hashes): | ||
index.add(base_hash, i) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: PDQIndex2.build()
python-threatexchange/threatexchange/signal_type/tests/test_pdq_index2.py
Show resolved
Hide resolved
from threatexchange.signal_type.pdq.pdq_utils import ( | ||
BITS_IN_PDQ, | ||
convert_pdq_strings_to_ndarray, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
python-threatexchange/threatexchange/signal_type/tests/test_pdq_index2.py
Outdated
Show resolved
Hide resolved
python-threatexchange/threatexchange/signal_type/pdq/pdq_index2.py
Outdated
Show resolved
Hide resolved
python-threatexchange/threatexchange/signal_type/pdq/pdq_index2.py
Outdated
Show resolved
Hide resolved
python-threatexchange/threatexchange/signal_type/pdq/pdq_utils.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So close!
- The way you are seeding your hashes is causing duplicates in your query set - see inline blocking comment
- As part of your test plan (doesn't have to be part of the test itself), please try generate a very large number of hashes and queries - e.g. 10,000 banked and 100,000 queries - something that takes 5-10 seconds to check. We want to be certain that the index is correct!
- Please add a test that serializes and deserializes a non-empty index - you can do something cheeky like:
original = ... # do stuff to prepare index
deserialized = PDQIndex2,deserialize(index.serialize(...))
for index in (original, deserialized):
# The original test goes here, should get same results on original or deserialized
def _generate_sample_hashes(size: int, seed: int = 42): | ||
random.seed(seed) | ||
return [PdqSignal.get_random_signal() for _ in range(size)] | ||
|
||
|
||
SAMPLE_HASHES = _generate_sample_hashes(100) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works for me!
from threatexchange.signal_type.pdq.pdq_utils import simple_distance | ||
|
||
|
||
def _generate_sample_hashes(size: int, seed: int = 42): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry what do you mean by the image:P ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
42 is the "Answer to the Ultimate Question of Life, The Universe, and Everything" in the book series "The Hitchhiker's Guide to the Galaxy", of which the image is on the cover. I thought you were a fan :P
Read more: https://en.wikipedia.org/wiki/Phrases_from_The_Hitchhiker%27s_Guide_to_the_Galaxy
python-threatexchange/threatexchange/signal_type/tests/test_pdq_index2.py
Outdated
Show resolved
Hide resolved
python-threatexchange/threatexchange/signal_type/pdq/pdq_index2.py
Outdated
Show resolved
Hide resolved
python-threatexchange/threatexchange/signal_type/tests/test_pdq_index2.py
Outdated
Show resolved
Hide resolved
@Dcallies |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All blocking comments resolved!
Summary
Resolve issue #1613 .
This PR introduces a new PDQIndex2 class for managing and querying a PDQ hash-based index using FAISS. Key changes include:
Test Plan
I have included several test cases for this, currently all in one file.
For the bug in issue #1318, I have created 2 test cases relating to it:
Also ran
test_pdq_index
with 1,000 banked & 10,000 queries in the unit test. The test passed but it took 4 minutes to run.And if i tried with 10,000 banked & 100,000 queries, the test took forever.